The Effect of Smoking on Vitamin B12 Deficiency

0.1 Preliminaries

library(knitr); library(rmdformats)

## Global options
opts_chunk$set(comment = NA,
               warning = FALSE,
               message = FALSE)
#opts_knit$set(width=75)
library(arm)
library(leaps)
library(tableone)
library(pander)
library(ROCR)
library(pROC)
library(skimr)
library(modelr)
library(forcats)
library(car)
library(rms)
library(broom)
library(tidyverse)

1 Task 1: Data Source

For Project 1, I will be using NHANES 2013-14 data, which is available at https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2013. I will be utilizing several data files from NHANES, including demographics (DEMO_H), Vitamin B12 laboratory measures (VITB12_H), BMI (BMX_H), Alcohol intake (ALQ_H), Smoking (SMQ_H), Supplements (DSQTOT_H), and Dietary intakes (DR1TOT_H, DR2TOT_H). I plan to investigate the effect of smoking on vitamin B12 level and the presence of B12 deficiency, adjusting for other known risk factors for vitamin B12 deficiency, in adults greater than the age of 40 years.

NHANES was intended to be a survey of the general health and nutrition of the US population. Per the NHANES documentation (https://wwwn.cdc.gov/nchs/data/series/sr02_162.pdf), sampling methods utilized a complex, multistage probability sampling design in order to identify a population representative of the civilian, non-institutionalized US population. Upon agreement to participate in the study, study staff performed in-home screening, interviews, laboratory collection, and physical examination.

Due to sample size requirements, I chose to use a complete case analysis for the variables of interest in the NHANES 2013-2014 dataset, in adults greater than 40 years of age, with n = 1000.

2 Task 2: Load and Tidy the Data

2.1 Data Load

The NHANES data is contained in multiple csv files, which are imported and merged into a single dataframe (nhanes01) in the chunk below.

# Loading in individual csv files from NHANES website
demographics <- read.csv(file="DEMO_H.csv",na.strings=c(" .","9")) 
tbl_df(demographics)
# A tibble: 10,175 x 3
    SEQN RIAGENDR RIDAGEYR
   <int>    <int>    <int>
 1 73557        1       69
 2 73558        1       54
 3 73559        1       72
 4 73560        1       NA
 5 73561        2       73
 6 73562        1       56
 7 73563        1        0
 8 73564        2       61
 9 73565        1       42
10 73566        2       56
# ... with 10,165 more rows
bmi <- read.csv("BMX_H.csv",na.strings=" .") 
tbl_df(bmi)
# A tibble: 9,813 x 2
    SEQN BMXBMI
   <int>  <dbl>
 1 73557   26.7
 2 73558   28.6
 3 73559   28.9
 4 73560   17.1
 5 73561   19.7
 6 73562   41.7
 7 73563   NA  
 8 73564   35.7
 9 73566   26.5
10 73567   22.0
# ... with 9,803 more rows
smk <- read.csv("SMQ_H.csv", na.strings=c(" .","9")) 
tbl_df(smk)
# A tibble: 7,168 x 2
    SEQN SMQ020
   <int>  <int>
 1 73557      1
 2 73558      1
 3 73559      1
 4 73561      2
 5 73562      1
 6 73564      2
 7 73565      1
 8 73566      1
 9 73567      1
10 73568      2
# ... with 7,158 more rows
etoh <- read.csv("ALQ_H.csv", na.strings=c(" .",""))
tbl_df(etoh)
# A tibble: 5,923 x 2
    SEQN ALQ101
   <int>  <int>
 1 73557      1
 2 73558      1
 3 73559      1
 4 73561      1
 5 73562      1
 6 73564      2
 7 73566      1
 8 73567      1
 9 73568      1
10 73571      2
# ... with 5,913 more rows
dr1tot <- read.csv("DR1TOT_H.csv", na.strings=c(" ."))
tbl_df(dr1tot)
# A tibble: 9,813 x 2
    SEQN DR1TVB12
   <int>    <dbl>
 1 73557    2.79 
 2 73558   21.4  
 3 73559    3.78 
 4 73560    8.76 
 5 73561    8.30 
 6 73562    1.68 
 7 73563   NA    
 8 73564    3.30 
 9 73566    2.73 
10 73567    0.790
# ... with 9,803 more rows
dr2tot <- read.csv("DR2TOT_H.csv", na.strings=c(" .",""))
tbl_df(dr2tot)
# A tibble: 9,813 x 2
    SEQN DR2TVB12
   <int>    <dbl>
 1 73557     2.62
 2 73558     4.24
 3 73559     4.96
 4 73560     3.49
 5 73561     9.11
 6 73562    NA   
 7 73563    NA   
 8 73564    13.9 
 9 73566    NA   
10 73567    NA   
# ... with 9,803 more rows
dsqtot <- read.csv("DSQTOT_H.csv", na.strings=c(" .",""))
tbl_df(dsqtot)
# A tibble: 10,175 x 3
    SEQN DSD010AN DSQTVB12
   <int>    <int>    <dbl>
 1 73557        2    NA   
 2 73558        2    NA   
 3 73559        2    18.0 
 4 73560        2    NA   
 5 73561        2    NA   
 6 73562        2    NA   
 7 73563        2    NA   
 8 73564        1    40.0 
 9 73565        2    NA   
10 73566        1     6.00
# ... with 10,165 more rows
b12 <- read.csv("VITB12_H.csv", na.strings=c(" .",""))
tbl_df(b12)
# A tibble: 5,736 x 2
    SEQN LBDB12
   <int>  <int>
 1 73557    524
 2 73558    507
 3 73559    732
 4 73561    225
 5 73562    750
 6 73564    668
 7 73566    378
 8 73567    194
 9 73568    528
10 73571    421
# ... with 5,726 more rows
# Merging into one dataframe
m1 <- merge(demographics,bmi,by="SEQN")
m2 <- merge(m1,smk,by="SEQN")
m3 <- merge(m2,etoh,by="SEQN")
m4 <- merge(m3,dr1tot,by="SEQN")

m5 <- merge(m4,dr2tot,by="SEQN")
m6 <- merge(m5,dsqtot,by="SEQN")
m7 <- merge(m6,b12,by="SEQN")

str(m7)
'data.frame':   5735 obs. of  11 variables:
 $ SEQN    : int  73557 73558 73559 73561 73562 73564 73566 73567 73568 73571 ...
 $ RIAGENDR: int  1 1 1 2 1 2 2 1 2 1 ...
 $ RIDAGEYR: int  69 54 72 73 56 61 56 65 26 76 ...
 $ BMXBMI  : num  26.7 28.6 28.9 19.7 41.7 35.7 26.5 22 20.3 34.4 ...
 $ SMQ020  : int  1 1 1 2 1 2 1 1 2 2 ...
 $ ALQ101  : int  1 1 1 1 1 2 1 1 1 2 ...
 $ DR1TVB12: num  2.79 21.45 3.78 8.3 1.68 ...
 $ DR2TVB12: num  2.62 4.24 4.96 9.11 NA 13.9 NA NA 7.65 5.61 ...
 $ DSD010AN: int  2 2 2 2 2 1 1 1 2 2 ...
 $ DSQTVB12: num  NA NA 18 NA NA 40 6 NA NA NA ...
 $ LBDB12  : int  524 507 732 225 750 668 378 194 528 421 ...
summary(m7)
      SEQN          RIAGENDR        RIDAGEYR         BMXBMI     
 Min.   :73557   Min.   :1.000   Min.   :19.00   Min.   :14.10  
 1st Qu.:76160   1st Qu.:1.000   1st Qu.:33.00   1st Qu.:24.00  
 Median :78712   Median :2.000   Median :48.00   Median :27.80  
 Mean   :78672   Mean   :1.524   Mean   :48.37   Mean   :29.04  
 3rd Qu.:81166   3rd Qu.:2.000   3rd Qu.:63.00   3rd Qu.:32.40  
 Max.   :83727   Max.   :2.000   Max.   :80.00   Max.   :82.90  
                                                 NA's   :75     
     SMQ020         ALQ101         DR1TVB12         DR2TVB12      
 Min.   :1.00   Min.   :1.000   Min.   : 0.000   Min.   :  0.000  
 1st Qu.:1.00   1st Qu.:1.000   1st Qu.: 1.975   1st Qu.:  1.900  
 Median :2.00   Median :1.000   Median : 3.580   Median :  3.530  
 Mean   :1.57   Mean   :1.299   Mean   : 4.871   Mean   :  4.775  
 3rd Qu.:2.00   3rd Qu.:2.000   3rd Qu.: 6.030   3rd Qu.:  5.960  
 Max.   :2.00   Max.   :9.000   Max.   :94.770   Max.   :107.040  
 NA's   :2      NA's   :493     NA's   :560      NA's   :1141     
    DSD010AN        DSQTVB12            LBDB12       
 Min.   :1.000   Min.   :    0.16   Min.   :   18.0  
 1st Qu.:2.000   1st Qu.:    6.00   1st Qu.:  381.0  
 Median :2.000   Median :   16.67   Median :  514.0  
 Mean   :1.903   Mean   :  220.77   Mean   :  637.5  
 3rd Qu.:2.000   3rd Qu.:   50.00   3rd Qu.:  708.0  
 Max.   :9.000   Max.   :15100.00   Max.   :26801.0  
                 NA's   :3909       NA's   :289      
nhanes01 <- m7

2.2 Tidying, Data Cleaning and Data Management

After importing the NHANES individual csv files and merging them to create my dataframe (initially called nhanes01), I further tidied the data by renaming continuous variables, renaming and releveling categorical variables, and creating an indicator variable for the binary outcome of interest (vitamin B12 deficiency), as detailed below:

  • ID: SEQN was renamed to id.

  • Age: RIDAGEYR was renamed to age.

  • Vitamin B12: LBDB12 was renamed to vit_b12, which is the subject’s vitamin B12 level in pg/ml.

  • Supplemental B12: DSQTVB12 was renamed to supp_b12.

  • Dietary B12: The mean of dietary intake of B12 on days 1 and 2 of evaluation, DR1TVB12 and DR2TVB12 (respectively), was calculated to create the variable diet_b12.

  • BMI: BMXBMI was renamed to bmi, then BMI categories (bmi_cat) were created for usage as a multicategorical predictor (Underweight (BMI<18.5), Normal (BMI 18.5-25.0), Overweight (BMI 25.0-30.0), and Obese (BMI>30.0) per CDC definitions of each category. The final BMI variable for my dataset is called bmi_cat.

  • Smoking: SMQ020, which was originally coded as 1 = Yes and 2 = No, was recoded as a binary indicator variable for whether a subject had smoked 100 cigarettes in his or her life (1 = Yes, 0 = no). This new variable is called smk100.

  • Alcohol: ALQ101 (whether a subject has consumed at least 12 alcoholic beverages in the past year) was renamed to etoh. It was originally coded as 1 = Yes and 2 = No, and was therefore recoded as a binary indicator variable (1 = Yes, 0 = No). Values in NHANES data of 9 indicated “Don’t know”, so these values were converted to NA.

  • Antacid use: DSD010AN in the original data, which asks the subject if he/she has used antacids in the past 30 days (with original coding 1 = Yes, 2 = No, 7 = Don’t know, and 9 = Refused), was recoded to a binary indicator variable where 1 = Yes, 0 = No, and the subjects who responded “Don’t know” or “refused” was coded as NA.

  • Vitamin B12 deficiency: I created a binary variable to indicate whether a subject had vitamin B12 deficiency, if vitamin B12 level (vit_b12) was <300.

## Renaming continuous variables
# ID
nhanes01$id_int <- nhanes01$SEQN
nhanes01$id <- as.character(nhanes01$id_int)
Hmisc::describe(nhanes01$id)
nhanes01$id 
       n  missing distinct 
    5735        0     5735 

lowest : 73557 73558 73559 73561 73562, highest: 83721 83723 83724 83726 83727
# BMI
nhanes01$bmi <- nhanes01$BMXBMI
Hmisc::describe(nhanes01$bmi)
nhanes01$bmi 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    5660       75      400        1    29.04    7.651    19.90    21.30 
     .25      .50      .75      .90      .95 
   24.00    27.80    32.40    38.21    42.60 

lowest : 14.1 14.2 15.2 15.4 16.1, highest: 70.1 71.5 74.1 77.5 82.9
# Age
nhanes01$age <- nhanes01$RIDAGEYR
Hmisc::describe(nhanes01$age)
nhanes01$age 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    5735        0       62        1    48.37    20.67       21       24 
     .25      .50      .75      .90      .95 
      33       48       63       74       80 

lowest : 19 20 21 22 23, highest: 76 77 78 79 80
# Vitamin B12
nhanes01$vit_b12 <- nhanes01$LBDB12
Hmisc::describe(nhanes01$vit_b12)
nhanes01$vit_b12 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    5446      289     1066        1    637.5    415.5    248.2    293.0 
     .25      .50      .75      .90      .95 
   381.0    514.0    708.0    979.5   1259.0 

lowest :    18    66    70    78    86, highest: 12494 12600 14900 15700 26801
# Supplemental B12
nhanes01$supp_b12 <- nhanes01$DSQTVB12 
Hmisc::describe(nhanes01$supp_b12)
nhanes01$supp_b12 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    1826     3909      308    0.995    220.8    385.2     1.33     3.00 
     .25      .50      .75      .90      .95 
    6.00    16.67    50.00   603.75  1012.00 
                                                                      
Value          0   200   400   600   800  1000  1200  1400  1600  1800
Frequency   1504    88    31    23     9    98     6     4     4     2
Proportion 0.824 0.048 0.017 0.013 0.005 0.054 0.003 0.002 0.002 0.001
                                                                      
Value       2000  2200  2400  2600  4000  4200  5000  6800 10000 15200
Frequency     12     1    15    12     1     1    10     1     3     1
Proportion 0.007 0.001 0.008 0.007 0.001 0.001 0.005 0.001 0.002 0.001
# Dietary B12
nhanes01$diet_b12 <- ((nhanes01$DR1TVB12+nhanes01$DR2TVB12)/2)
Hmisc::describe(nhanes01$diet_b12)
nhanes01$diet_b12 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    4594     1141     2286        1    4.797    3.623    1.088    1.480 
     .25      .50      .75      .90      .95 
   2.445    3.845    5.940    8.867   11.620 

lowest :  0.010  0.015  0.025  0.055  0.080, highest: 39.920 40.830 48.370 52.355 54.415
## Managing categorical variables
# BMI
nhanes01$bmi_cat <- Hmisc::cut2(nhanes01$bmi, cuts = c(18.5, 25, 30))
Hmisc::describe(nhanes01$bmi_cat)
nhanes01$bmi_cat 
       n  missing distinct 
    5660       75        4 
                                                          
Value      [14.1,18.5) [18.5,25.0) [25.0,30.0) [30.0,82.9]
Frequency          103        1635        1801        2121
Proportion       0.018       0.289       0.318       0.375
nhanes01$bmi_cat2 <- nhanes01$bmi_cat
levels(nhanes01$bmi_cat2)
[1] "[14.1,18.5)" "[18.5,25.0)" "[25.0,30.0)" "[30.0,82.9]"
nhanes01$bmi_cat2 <- fct_recode(nhanes01$bmi_cat,
                                  "Underweight" = "[14.1,18.5)",
                                  "Normal" = "[18.5,25.0)",
                                  "Overweight" = "[25.0,30.0)",
                                  "Obese" = "[30.0,82.9]")
table(nhanes01$bmi_cat, nhanes01$bmi_cat2)
             
              Underweight Normal Overweight Obese
  [14.1,18.5)         103      0          0     0
  [18.5,25.0)           0   1635          0     0
  [25.0,30.0)           0      0       1801     0
  [30.0,82.9]           0      0          0  2121
nhanes01$bmi_cat <- nhanes01$bmi_cat2

# Smoking
nhanes01$smk100 <- ifelse(nhanes01$SMQ020==1, 1, 0)
table(nhanes01$smk100, nhanes01$SMQ020)
   
       1    2
  0    0 3266
  1 2467    0
# Alcohol
nhanes01$etoh <- ifelse(nhanes01$ALQ101==1,1,ifelse(nhanes01$ALQ101==9,NA,0))
table(nhanes01$etoh, nhanes01$ALQ101)
   
       1    2    9
  0    0 1505    0
  1 3729    0    0
# Antacid use
nhanes01$antacid <- ifelse(nhanes01$DSD010AN==1,1,ifelse(nhanes01$DSD010AN==7,NA, ifelse(nhanes01$DSD010AN==9,NA,0)))
table(nhanes01$antacid, nhanes01$DSD010AN)
   
       1    2    7    9
  0    0 5167    0    0
  1  566    0    0    0
# Creating indicator variable for vitamin B12 deficiency (vit_b12<300)
nhanes01$b12def <- ifelse(nhanes01$vit_b12<300,1,0)

2.2.1 Subset Columns

Given the sample size requirements, I selected complete cases for the variables of interest, and subsetted by age greater than 40, with the final n = 1000. I created a tidy csv file with the variables of interest for export, which is then imported for further analysis below.

# Selecting variables of interest
nhanes02 <- nhanes01 %>%
 dplyr::select(id, age, vit_b12, b12def, smk100, bmi_cat, etoh, antacid, supp_b12, diet_b12)

# Selecting complete cases for sample size requirements
b12_final <- nhanes02 %>% 
  filter(complete.cases(.)) %>%
  filter(age > 40)
tbl_df(b12_final)
# A tibble: 1,000 x 10
   id      age vit_b12 b12def smk100 bmi_cat     etoh antacid supp_b12
   <chr> <int>   <int>  <dbl>  <dbl> <fct>      <dbl>   <dbl>    <dbl>
 1 73559    72     732      0   1.00 Overweight  1.00    0       18.0 
 2 73564    61     668      0   0    Obese       0       1.00    40.0 
 3 73610    43     548      0   0    Overweight  1.00    0       18.0 
 4 73613    60     538      0   0    Obese       1.00    0       16.7 
 5 73616    62     318      0   1.00 Normal      1.00    0       16.7 
 6 73622    72     378      0   0    Overweight  1.00    0        4.00
 7 73628    80    4560      0   1.00 Normal      1.00    1.00  2500   
 8 73642    57    1040      0   1.00 Obese       0       1.00  1010   
 9 73643    62     904      0   1.00 Overweight  1.00    0       25.0 
10 73659    70     697      0   0    Obese       1.00    0     1000   
# ... with 990 more rows, and 1 more variable: diet_b12 <dbl>
# Evaluating values of outcome in new dataset
Hmisc::describe(b12_final$vit_b12)
b12_final$vit_b12 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    1000        0      626        1    875.1    641.2    304.9    362.9 
     .25      .50      .75      .90      .95 
   481.0    669.0    933.2   1344.6   2100.2 

lowest :   154   157   160   173   174, highest:  7580  7750  9317 12099 12494
table(b12_final$b12def)

  0   1 
952  48 
# Write tidy csv file
write.csv(b12_final,"Baldassari-tidy.csv", row.names=FALSE,na="NA")
# Read back in tidy csv file to start from tidy dataset
b12_final <- read.csv("Baldassari-tidy.csv",na.strings="NA")

2.3 Are there missing values?

Below, the complete case analysis (lack of missing values) is confirmed.

na.pattern(b12_final)
pattern
0000000000 
      1000 

3 Task 3: Listing of My Tibble

glimpse(b12_final)
Observations: 1,000
Variables: 10
$ id       <int> 73559, 73564, 73610, 73613, 73616, 73622, 73628, 7364...
$ age      <int> 72, 61, 43, 60, 62, 72, 80, 57, 62, 70, 59, 70, 51, 6...
$ vit_b12  <int> 732, 668, 548, 538, 318, 378, 4560, 1040, 904, 697, 5...
$ b12def   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ smk100   <int> 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,...
$ bmi_cat  <fct> Overweight, Obese, Overweight, Obese, Normal, Overwei...
$ etoh     <int> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,...
$ antacid  <int> 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,...
$ supp_b12 <dbl> 18.00, 40.00, 18.00, 16.67, 16.67, 4.00, 2500.00, 100...
$ diet_b12 <dbl> 4.370, 8.600, 3.690, 5.470, 2.305, 4.360, 3.530, 2.48...
str(b12_final)
'data.frame':   1000 obs. of  10 variables:
 $ id      : int  73559 73564 73610 73613 73616 73622 73628 73642 73643 73659 ...
 $ age     : int  72 61 43 60 62 72 80 57 62 70 ...
 $ vit_b12 : int  732 668 548 538 318 378 4560 1040 904 697 ...
 $ b12def  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ smk100  : int  1 0 0 0 1 0 1 1 1 0 ...
 $ bmi_cat : Factor w/ 4 levels "Normal","Obese",..: 3 2 3 2 1 3 1 2 3 2 ...
 $ etoh    : int  1 0 1 1 1 1 1 0 1 1 ...
 $ antacid : int  0 1 0 0 0 0 1 1 0 0 ...
 $ supp_b12: num  18 40 18 16.7 16.7 ...
 $ diet_b12: num  4.37 8.6 3.69 5.47 2.31 ...

My tibble has 1000 subject/rows and 10 variables/columns.

4 Task 4: Code Book

Below is the codebook for my Project 1.

attach(b12_final)
Variable Class Description Range or Levels NA
id integer patient identification code Range: 73559, 83721 -
age integer age (years) Range: 41, 80 -
vit_b12 integer Vitamin B12 level (pg/ml) Range: 154, 12494 -
b12def integer B12 deficiency (B12 level <300 pg/ml) (1 = yes, 0 = no) 48 (4.8%) b12def -
smk100 integer Smoked 100 cigarettes in lifetime (1 = yes, 0 = no) 450 (45%) smk100 -
bmi_cat factor BMI category (Underweight, Normal, Overweight, or Obese) 13 (1.3%) Underweight, 254 (25.4%) Normal, 361 (36.1%) Overweight, 372 (37.2%) Obese -
etoh integer At least 12 alcoholic beverages in the past year (1 = yes, 0 = no) NA (NA%) etoh -
antacid integer Antacid use in the past 30 days (1 = yes, 0 = no) 142 (14.2%) antacid -
supp_b12 numeric Supplemental Vitamin B12 level intake (mcg) Range: 0.16, 1.5110^{4} -
diet_b12 numeric Dietary Vitamin B12 level intake (mcg) Range: 0.16, 54.415 -
detach(b12_final)

As this was a complete case analysis, there are no NAs.

5 Task 5: My Subjects

The data include a sample of NHANES 2013-2014 data with 1000 subjects greater than the age of 40 who had complete information on the predictors and outcome of interest (vitamin B12 level).

6 Task 6: My Variables

There are 10 variables in the b12_final data set.

  1. id
    • This is a subject identification code, ranging from 73564 to 83721. It is a renamed version of SEQN from the original NHANES data.
  2. age
    • This is the patient’s age in years. Age is the renamed version of RIDAGEYR from the Demographics table of NHANES, and represents the subject’s age in years. All subjects in this dataset are greater than 40 years of age.
  3. vit_b12
    • This is the patient’s serum vitamin B12 level, in pg/ml. vit_b12 is the renamed version of LBDB12 from the original NHANES data (B12 Table).
  4. b12def
    • This is the binary indicator variable I created for Vitamin B12 deficiency, my binary outcome of interest. b12def = 1 if the patient has vitamin B12 deficiency (defined as vitamin B12 level <300), and is 0 if they do not.
  5. smk100
    • This variable indicates whether a subject has smoked at least 100 cigarettes in their lifetime (1= Yes, 0 = No). It is the renamed version of SMQ020 from the Smoking Questionnaire from NHANES data.
  6. bmi_cat
    • This variable indicates the subject’s BMI category, and can take 4 possible values: Underweight (BMI<18.5), Normal (BMI 18.5-25.0), Overweight (BMI 25.0-30.0), and Obese (BMI>30.0). It was derived from BMI data available via NHANES Body Measures Table, and the variable for BMI there was BMXBMI.
  7. etoh
    • This is a binary indicator variable for whether a subject reported drinking at least 12 alcoholic beverages in the past year (1 = Yes, 0 = No).
  8. antacid
    • This is a binary indicator variable for whether a subject reported taking an antacid medication in the past 30 days (1 = Yes, 0 = No).
  9. supp_b12
    • This is the amount of supplementary vitamin B12 a patient took in the 30 days prior to evaluation, in mcg.
  10. diet_b12
    • This is the daily mean amount of dietary vitamin B12 a patient ingested over the 2 days of dietary evaluation, in mcg.

7 Task 7: My Planned Linear Regression Model

For my planned linear regression model, I plan to use vitamin B12 level (pg/ml) as a quantitative variable, and my candidate predictors will be the following:

  • age
  • smk100
  • bmi_cat
  • etoh
  • antacid
  • supp_b12
  • diet_b12

The key predictor I will focus on here is smoking.

7.1 Spearman \(\rho^2\) Plot

A Spearman \(\rho^2\) plot indicates that supplemental B12 is the strongest predictor for vitamin B12 level, and alcohol and age are other variables of interest but with a much smaller predictive punch. Supplemental B12 is an important confounder of this relationship that will be important to account for in this analysis.

plot(spearman2(vit_b12 ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12, data = b12_final))

8 Task 8: My Planned Logistic Regression Model

For my planned logistic regression model, my binary outcome will be vitamin B12 deficiency (defined as serum vitamin B12 level < 300), represented by the variable b12def. My candidate predictors will be the same as above:

  • age
  • smk100
  • bmi_cat
  • etoh
  • antacid
  • supp_b12
  • diet_b12

The key predictor I will focus on here is smoking.

8.1 Spearman \(\rho^2\) Plot

A Spearman \(\rho^2\) plot indicates that supplemental and dietary B12, as well as smoking, are the strongest predictors for vitamin B12 deficiency. Supplemental and dietary B12 are important confounders of this relationship that will be important to account for in this analysis.

plot(spearman2(b12def ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12, data = b12_final))

9 Task 9: Affirmation

This data set meets all requirements specified in the project instructions.

  • The data set contains 1000 observations on 10 variables, which is within the limits of 100-1000 observations on 7-20 variables set in the assignment.
  • There are no missing values in the data.
  • I am considering at least four predictors for each regression model, and I include at least one quantitative (for example, diet_b12) and multi-categorical variable (for example, bmi_cat) in each model.
  • I (Laura Baldassari) am (is) certain that it is completely appropriate for these data to be shared with anyone, without any conditions. There are no concerns about privacy or security, mostly because the data have been on a public website for many years, and are completely free of identifying information about individual subjects.

10 Task 10: An Analysis: Linear Regression

10.1 Exploratory data analysis

For further exploratory analysis, I utilized skim and plotted a histogram, Boxplot, and normal QQ plot of the outcome (vitamin B12 level). These analyses revealed that the outcome is extremely right skewed due to outliers. I therefore evaluated for possible transformation via BoxCox analysis.

skim(b12_final)
Skim summary statistics
 n obs: 1000 
 n variables: 10 

Variable type: factor 
 variable missing complete    n n_unique
  bmi_cat       0     1000 1000        4
                            top_counts ordered
 Obe: 372, Ove: 361, Nor: 254, Und: 13   FALSE

Variable type: integer 
 variable missing complete    n      mean      sd    p0      p25 median
      age       0     1000 1000    61.01    11.87    41    51        61
  antacid       0     1000 1000     0.14     0.35     0     0         0
   b12def       0     1000 1000     0.048    0.21     0     0         0
     etoh       0     1000 1000     0.74     0.44     0     0         1
       id       0     1000 1000 78566.26  2937.18 73559 75995.25  78582
   smk100       0     1000 1000     0.45     0.5      0     0         0
  vit_b12       0     1000 1000   875.15   943.18   154   481       669
      p75  p100     hist
    70       80 ▆▅▆▇▇▆▅▇
     0        1 ▇▁▁▁▁▁▁▁
     0        1 ▇▁▁▁▁▁▁▁
     1        1 ▃▁▁▁▁▁▁▇
 81011.5  83721 ▇▇▇▇▇▇▇▇
     1        1 ▇▁▁▁▁▁▁▆
   933.25 12494 ▇▁▁▁▁▁▁▁

Variable type: numeric 
 variable missing complete    n   mean     sd   p0  p25 median   p75
 diet_b12       0     1000 1000   4.76   4.11 0.16 2.49    3.8  5.86
 supp_b12       0     1000 1000 256.76 811.31 0.16 6      25   75   
     p100     hist
    54.41 ▇▂▁▁▁▁▁▁
 15100    ▇▁▁▁▁▁▁▁
# Histogram of B12 level
ggplot(b12_final, aes(x= vit_b12)) +
  geom_histogram(aes(y=..density..)) + 
  stat_function(fun=dnorm, 
                args=list(mean=mean(b12_final$vit_b12), 
                          sd=sd(b12_final$vit_b12)),
                lwd = 1.5) +
  labs(title = "Histogram of B12 level", x="Vitamin B12 level", y="") +
  theme_bw()

# Boxplot of B12 level
ggplot(b12_final, aes(x = 1, y = vit_b12)) + 
    geom_boxplot(notch = TRUE) +
    labs(title = "Boxplot", x = "", y = "") +
    theme_bw() +
    theme(axis.text.x = element_blank(),
          axis.ticks.x = element_blank()) 

# Normal QQ plot of B12 level
qqnorm(b12_final$vit_b12, main = "Normal Q-Q plot for vitamin B12")
qqline(b12_final$vit_b12, col = "red")

# Box-Cox for B12 level

boxCox(lm(vit_b12 ~ diet_b12 + supp_b12 + bmi_cat + etoh + smk100 + age + antacid, data=b12_final))

powerTransform(lm(vit_b12 ~ diet_b12 + supp_b12 + bmi_cat + etoh + smk100 + age + antacid, data=b12_final))
Estimated transformation parameters 
        Y1 
-0.3007455 

The BoxCox analysis estimates a transformation parameter of -0.3, which is closest to -0.5. I will therefore transform the outcome (vitamin B12 level) per Tukey’s ladder to 1/sqrt(y).

# Transforming B12 to 1/sqrt(B12)
b12_final$b12_trans <- (1/(sqrt(b12_final$vit_b12)))

ggplot(b12_final, aes(x=vit_b12, y = b12_trans)) + geom_point()

# Histogram of transformed B12 level
ggplot(b12_final, aes(x= b12_trans)) +
  geom_histogram(aes(y=..density..)) + 
  stat_function(fun=dnorm, 
                args=list(mean=mean(b12_final$b12_trans), 
                          sd=sd(b12_final$b12_trans)),
                lwd = 1.5) +
  labs(title = "Histogram of transformed B12 level", x="1/sqrt(Vitamin B12 level)", y="") + theme_bw()

# Boxplot of B12 level
ggplot(b12_final, aes(x = 1, y = b12_trans)) + 
    geom_boxplot(notch = TRUE) +
    labs(title = "Boxplot", x = "", y = "") +
    theme_bw() +
    theme(axis.text.x = element_blank(),
          axis.ticks.x = element_blank()) 

# Normal QQ plot of B12 level
qqnorm(b12_final$b12_trans, main = "Normal Q-Q plot for transformed vitamin B12")
qqline(b12_final$b12_trans, col = "red")

The transformed outcome now appears to be approximately normally distributed on histogram, boxplot, and normal QQ plot, which is a marked improvement from its untransformed values.

After transforming the outcome, I reevaluated a Spearman \(\rho^2\) plot to determine which predictors were most promising for nonlinear terms. The plot again indicates that supplemental B12 is the strongest predictor for vitamin B12 level, and alcohol and age are other variables of interest but with a much smaller predictive punch. Therefore, I will use supplemental B12, alcohol, and age as considerations for a nonlinear term. Supplemental B12 is an important confounder that will be important to account for in this analysis.

plot(spearman2(b12_trans ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12, data = b12_final))

For this analysis, the approximate number of degrees of freedom is n/15 = 1000/15 = 66. Therefore, I will spend 3 degrees of freedom evaluating supplementary B12, alcohol, and age for possible non-linearity via scatterplots versus vitamin B12 level.

ggplot(data=b12_final, aes(x= supp_b12, y = b12_trans)) +
  geom_point() + 
  geom_smooth(method= "lm", se= FALSE) + 
  geom_smooth(method= "loess", se=FALSE)

ggplot(data=b12_final, aes(x= etoh, y = b12_trans)) +
  geom_point()+ 
  geom_smooth(method= "lm", se= FALSE)+ 
  geom_smooth(method= "loess", se=FALSE)

ggplot(data=b12_final, aes(x= age, y = b12_trans)) +
  geom_point()+ 
  geom_smooth(method= "lm", se= FALSE)+ 
  geom_smooth(method= "loess", se=FALSE)

10.2 Model A - kitchen sink with and without nonlinear term

Given the scatterplots above, I created a modified kitchen sink model with a restricted cubic spline with 4 knots for supplemental B12 intake. It does not appear appropriate to try to fit a nonlinear term to alcohol and age.

modelA_rcs <- lm(b12_trans ~ rcs(supp_b12,4) + age + smk100 + bmi_cat + etoh + antacid + diet_b12, data=b12_final)
summary(modelA_rcs)

Call:
lm(formula = b12_trans ~ rcs(supp_b12, 4) + age + smk100 + bmi_cat + 
    etoh + antacid + diet_b12, data = b12_final)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.033944 -0.006466 -0.000949  0.005646  0.040411 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 4.304e-02  1.922e-03  22.397  < 2e-16 ***
rcs(supp_b12, 4)supp_b12   -2.630e-04  6.538e-05  -4.022 6.21e-05 ***
rcs(supp_b12, 4)supp_b12'   2.495e-01  6.979e-02   3.575 0.000368 ***
rcs(supp_b12, 4)supp_b12'' -3.537e-01  9.906e-02  -3.571 0.000373 ***
age                        -1.003e-05  2.723e-05  -0.368 0.712664    
smk100                     -5.524e-04  6.597e-04  -0.837 0.402636    
bmi_catObese                1.610e-03  8.064e-04   1.996 0.046198 *  
bmi_catOverweight           8.171e-04  8.040e-04   1.016 0.309758    
bmi_catUnderweight         -1.487e-03  2.797e-03  -0.532 0.595032    
etoh                        2.492e-03  7.550e-04   3.301 0.000998 ***
antacid                    -6.535e-04  8.939e-04  -0.731 0.464918    
diet_b12                   -7.349e-05  7.578e-05  -0.970 0.332430    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.009799 on 988 degrees of freedom
Multiple R-squared:  0.1789,    Adjusted R-squared:  0.1698 
F-statistic: 19.57 on 11 and 988 DF,  p-value: < 2.2e-16
anova(modelA_rcs)
Analysis of Variance Table

Response: b12_trans
                  Df   Sum Sq   Mean Sq F value    Pr(>F)    
rcs(supp_b12, 4)   3 0.019122 0.0063741 66.3813 < 2.2e-16 ***
age                1 0.000095 0.0000953  0.9922  0.319456    
smk100             1 0.000005 0.0000051  0.0534  0.817239    
bmi_cat            3 0.000320 0.0001066  1.1097  0.344149    
etoh               1 0.000985 0.0009846 10.2534  0.001408 ** 
antacid            1 0.000053 0.0000533  0.5547  0.456589    
diet_b12           1 0.000090 0.0000903  0.9403  0.332430    
Residuals        988 0.094870 0.0000960                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Create OLS model to create nomogram
d <- datadist(b12_final)
options(datadist="d")

modelA_rcs_ols <- ols(b12_trans ~ rcs(supp_b12,4) + age + smk100 + bmi_cat + etoh + antacid + diet_b12, data = b12_final,x = TRUE, y = TRUE)

plot(nomogram(modelA_rcs_ols))

Based on the nonlinear model above’s nomogram, it is apparent that the restricted cubic spline supplemental B12 clearly has the strongest impact upon the transformed vitamin B12 level.

modelA <- lm(b12_trans ~ supp_b12 + age + smk100 + bmi_cat + etoh + antacid + diet_b12, data=b12_final)
summary(modelA)

Call:
lm(formula = b12_trans ~ supp_b12 + age + smk100 + bmi_cat + 
    etoh + antacid + diet_b12, data = b12_final)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.031024 -0.006603 -0.001004  0.005807  0.043818 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         4.198e-02  1.973e-03  21.277  < 2e-16 ***
supp_b12           -4.037e-06  3.995e-07 -10.106  < 2e-16 ***
age                -5.064e-05  2.764e-05  -1.832  0.06729 .  
smk100             -9.431e-04  6.838e-04  -1.379  0.16814    
bmi_catObese        1.177e-03  8.368e-04   1.406  0.15998    
bmi_catOverweight   5.953e-04  8.355e-04   0.712  0.47633    
bmi_catUnderweight -2.150e-03  2.906e-03  -0.740  0.45957    
etoh                2.457e-03  7.839e-04   3.135  0.00177 ** 
antacid            -6.838e-04  9.283e-04  -0.737  0.46148    
diet_b12           -7.467e-05  7.868e-05  -0.949  0.34282    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01019 on 990 degrees of freedom
Multiple R-squared:  0.1104,    Adjusted R-squared:  0.1023 
F-statistic: 13.65 on 9 and 990 DF,  p-value: < 2.2e-16
anova(modelA)
Analysis of Variance Table

Response: b12_trans
           Df   Sum Sq   Mean Sq  F value    Pr(>F)    
supp_b12    1 0.010799 0.0107986 104.0122 < 2.2e-16 ***
age         1 0.000632 0.0006319   6.0868  0.013788 *  
smk100      1 0.000017 0.0000173   0.1663  0.683494    
bmi_cat     3 0.000204 0.0000681   0.6561  0.579199    
etoh        1 0.000954 0.0009541   9.1902  0.002496 ** 
antacid     1 0.000059 0.0000589   0.5669  0.451667    
diet_b12    1 0.000094 0.0000935   0.9007  0.342824    
Residuals 990 0.102782 0.0001038                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the model with a restricted cubic spline (RCS), all knots for supplemental B12 RCS and alcohol appear to have statistically significant predictive value, per ANOVA. BMI category of obese also has some marginal statistically significant predictive value in the linear model, but BMI as a term does not with ANOVA testing.

However, without the restricted cubic spline, the terms for supplemental B12, age, and alcohol appear to have statistically significant predictive value.

Further model comparison was performed via ANOVA nested model comparison and glance to compare AIC/BIC.

# Nested model comparison by ANOVA
anova(modelA, modelA_rcs)
Analysis of Variance Table

Model 1: b12_trans ~ supp_b12 + age + smk100 + bmi_cat + etoh + antacid + 
    diet_b12
Model 2: b12_trans ~ rcs(supp_b12, 4) + age + smk100 + bmi_cat + etoh + 
    antacid + diet_b12
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    990 0.10278                                  
2    988 0.09487  2 0.0079119 41.198 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Glance
glance(modelA_rcs)
  r.squared adj.r.squared       sigma statistic      p.value df   logLik
1 0.1789021     0.1697603 0.009799119  19.56972 6.025097e-36 12 3212.561
        AIC      BIC   deviance df.residual
1 -6399.121 -6335.32 0.09487045         988
glance(modelA)
  r.squared adj.r.squared      sigma statistic      p.value df  logLik
1 0.1104255     0.1023384 0.01018924  13.65462 7.822512e-21 10 3172.51
       AIC       BIC  deviance df.residual
1 -6323.02 -6269.035 0.1027823         990

The ANOVA testing for nested models indicates that there is significant predictive value offered by the restricted cubic spline term, as p < 0.05 with the addition of the term to the model.

Model R2 Adjusted R2 AIC BIC degrees of freedom
A with RCS 0.179 0.170 -6399.1 -6335.3 12
A without RCS 0.110 0.102 -6323.0 -6269.0 10

Based on its higher multiple/adjusted R and lower AIC/BIC, despite use of 2 additional degrees of freedom, the kitchen sink model with the restricted cubic spline outperforms the kitchen sink model without it.

I will proceed with best subsets analysis to further refine the search for the best predictors, keeping model A with a restricted cubic spline for later comparison.

10.3 Running “Best Subsets” to select predictors

Since smoking is my primary predictor of interest, I ran best subsets regression with forcing my smoking variable (smk100) into each model.

library(leaps)
preds<- with(b12_final, cbind(supp_b12, age, smk100, etoh, antacid, diet_b12))

x1 <- regsubsets(preds, y=b12_final$b12_trans, data=b12_final, force.in=c("smk100"))
rs <- summary(x1)
rs
Subset selection object
6 Variables  (and intercept)
         Forced in Forced out
smk100       FALSE      FALSE
supp_b12     FALSE      FALSE
age           TRUE      FALSE
etoh         FALSE      FALSE
antacid      FALSE      FALSE
diet_b12     FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: exhaustive
         smk100 supp_b12 age etoh antacid diet_b12
2  ( 1 ) "*"    "*"      " " " "  " "     " "     
3  ( 1 ) "*"    "*"      " " "*"  " "     " "     
4  ( 1 ) "*"    "*"      "*" "*"  " "     " "     
5  ( 1 ) "*"    "*"      "*" "*"  " "     "*"     
6  ( 1 ) "*"    "*"      "*" "*"  "*"     "*"     
# Summarizing "winners"
winners <- tbl_df(rs$which)
winners$k <- 3:7
winners$r2 <- rs$rsq
winners$adjr2 <- rs$adjr2
winners$cp <- rs$cp
winners$bic <- rs$bic

winners
# A tibble: 5 x 12
  `(Intercept)` smk100 supp_b12 age   etoh  antacid diet_b12     k     r2
  <lgl>         <lgl>  <lgl>    <lgl> <lgl> <lgl>   <lgl>    <int>  <dbl>
1 T             T      T        F     F     F       F            3 0.0937
2 T             T      T        F     T     F       F            4 0.103 
3 T             T      T        T     T     F       F            5 0.107 
4 T             T      T        T     T     F       T            6 0.107 
5 T             T      T        T     T     T       T            7 0.108 
# ... with 3 more variables: adjr2 <dbl>, cp <dbl>, bic <dbl>

10.3.1 Building the Four Summary Plots

# Note: 1000 subjects, models for 1-5 predictors in addition to smoking (3-7 inputs) 

# Adjusted R2 plot
p1 <- ggplot(winners, aes(x = k, y = adjr2, 
                       label = round(adjr2,2))) +
    geom_line() +
    geom_label() +
    geom_label(data = subset(winners, 
                             adjr2 == max(adjr2)),
               aes(x = k, y = adjr2, label = round(adjr2,2)), 
               fill = "yellow", col = "blue") +
    theme_bw() +
    scale_x_continuous(breaks = 3:7) +
    labs(x = "# of predictors (including intercept)",
         y = "Adjusted R-squared")

# Cp plot
p2 <- ggplot(winners, aes(x = k, y = cp, 
                             label = round(cp,1))) +
    geom_line() +
    geom_label() +
    geom_abline(intercept = 0, slope = 1, 
                col = "red") +
    theme_bw() +
    scale_x_continuous(breaks = 3:7) +
    labs(x = "# of predictors (including intercept)",
         y = "Mallows' Cp")

# Calculating AIC.corr
temp.n <- nrow(b12_final)
temp.inputs <- 6
    
rs$aic.corr <- temp.n*log(rs$rss / temp.n) + 
    2*(2:temp.inputs) +
    (2 * (2:temp.inputs) * ((2:temp.inputs)+1) / 
         (temp.n - (2:temp.inputs) - 1))

winners$aic.corr <- rs$aic.corr

# Plot AIC.corr
p3 <- ggplot(winners, aes(x = k, y = rs$aic.corr, 
                             label = round(aic.corr,1))) +
    geom_line() +
    geom_label() +
    geom_label(data = subset(winners, 
                             aic.corr == min(aic.corr)),
               aes(x = k, y = aic.corr), 
               fill = "pink", col = "red") +
    theme_bw() +
    scale_x_continuous(breaks = 3:7) +
    labs(x = "# of predictors (including intercept)",
         y = "Bias-Corrected AIC")

# BIC plot - BIC is minimized by k = 2
p4 <- ggplot(winners, aes(x = k, y = bic, 
                             label = round(bic,1))) +
    geom_line() +
    geom_label() +
    geom_label(data = subset(winners, bic == min(bic)),
               aes(x = k, y = bic), 
               fill = "lightgreen", col = "blue") +
    theme_bw() +
    scale_x_continuous(breaks = 3:7) +
    labs(x = "# of predictors (including intercept)",
         y = "BIC")

# All four plots together
gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2)

10.3.2 Candidate Models from “Best Subsets”

Tool Inputs Predictors
BIC 4 smk100, supp_b12, etoh
AICc, Cp, R2 (adj.) 5 smk100, supp_b12, etoh, age

10.3.3 Comparison of Candidate Models from Best Subsets via ANOVA

lm4 <- lm(b12_trans ~ smk100 + supp_b12 + etoh, data=b12_final)
lm5 <- lm(b12_trans ~ smk100 + supp_b12 + etoh + age, data=b12_final)
anova(lm4,lm5)
Analysis of Variance Table

Model 1: b12_trans ~ smk100 + supp_b12 + etoh
Model 2: b12_trans ~ smk100 + supp_b12 + etoh + age
  Res.Df     RSS Df  Sum of Sq      F  Pr(>F)  
1    996 0.10363                               
2    995 0.10322  1 0.00041074 3.9594 0.04688 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

10.4 Comparing model from best subsets to model A with RCS

Based on this ANOVA result where it appears that age results in a (barely) statistically significant increase in predictive value, I will use the linear model with 5 inputs as model B, to compare to model A above with the restricted cubic spline for supplementary B12 intake.

modelB <- lm(b12_trans ~ smk100 + supp_b12 + etoh + age, data=b12_final)
glance(modelB)
  r.squared adj.r.squared      sigma statistic      p.value df  logLik
1 0.1066449     0.1030535 0.01018518   29.6947 2.329966e-23  5 3170.39
        AIC       BIC  deviance df.residual
1 -6328.779 -6299.333 0.1032191         995
glance(modelA_rcs)
  r.squared adj.r.squared       sigma statistic      p.value df   logLik
1 0.1789021     0.1697603 0.009799119  19.56972 6.025097e-36 12 3212.561
        AIC      BIC   deviance df.residual
1 -6399.121 -6335.32 0.09487045         988
Model R2 Adjusted R2 AIC BIC degrees of freedom
A with RCS 0.179 0.170 -6399.1 -6335.3 12
B 0.107 0.103 -6328.8 -6299.3 5

I then proceeded with cross-validation of each model to evaluate their performances further.

# Model A (with restricted cubic spline)
set.seed(432)

b12_modelA <- b12_final %>%
    crossv_kfold(k = 10) %>%
    mutate(model = map(train, ~ lm(b12_trans ~ rcs(supp_b12, 4) + age + smk100 + bmi_cat + etoh + antacid + diet_b12, data = .)))

modelA_predictions <- b12_modelA  %>%
    unnest(map2(model, test, ~ augment(.x, newdata = .y)))

head(modelA_predictions)
# A tibble: 6 x 14
  .id      id   age vit_b12 b12def smk100 bmi_cat    etoh antacid supp_b12
  <chr> <int> <int>   <int>  <int>  <int> <fct>     <int>   <int>    <dbl>
1 01    73564    61     668      0      0 Obese         0       1    40.0 
2 01    73622    72     378      0      0 Overweig…     1       0     4.00
3 01    73690    77     818      0      0 Normal        0       0     2.50
4 01    73708    69     636      0      1 Overweig…     1       1    15.0 
5 01    73720    48     565      0      0 Obese         1       0     6.00
6 01    73820    56     484      0      0 Overweig…     1       0     6.00
# ... with 4 more variables: diet_b12 <dbl>, b12_trans <dbl>,
#   .fitted <dbl>, .se.fit <dbl>
modelA_predictions %>%
    summarize(RMSE_modelA = sqrt(mean((b12_trans - .fitted) ^2)),
              MAE_modelA = mean(abs(b12_trans - .fitted)))
# A tibble: 1 x 2
  RMSE_modelA MAE_modelA
        <dbl>      <dbl>
1        1.90      0.490
# Graph cross validated prediction errors for model 2
modelA_predictions %>%
    mutate(errors = b12_trans - .fitted) %>%
    ggplot(., aes(x = errors)) +
    geom_histogram(bins = 30, fill = "darkviolet", col = "yellow") + 
    labs(title = "Cross-Validated Errors in Prediction of 1/sqrt(vitamin B12)",
         subtitle = "Using model A",
         x = "Error in predicting transformed vitamin B12")

# Model B (from Best subsets)
set.seed(432)

b12_modelB <- b12_final %>%
    crossv_kfold(k = 10) %>%
    mutate(model = map(train, ~ lm(b12_trans ~ smk100 + supp_b12 + etoh + age, data = .)))

modelB_predictions <- b12_modelB  %>%
    unnest(map2(model, test, ~ augment(.x, newdata = .y)))

head(modelB_predictions)
# A tibble: 6 x 14
  .id      id   age vit_b12 b12def smk100 bmi_cat    etoh antacid supp_b12
  <chr> <int> <int>   <int>  <int>  <int> <fct>     <int>   <int>    <dbl>
1 01    73564    61     668      0      0 Obese         0       1    40.0 
2 01    73622    72     378      0      0 Overweig…     1       0     4.00
3 01    73690    77     818      0      0 Normal        0       0     2.50
4 01    73708    69     636      0      1 Overweig…     1       1    15.0 
5 01    73720    48     565      0      0 Obese         1       0     6.00
6 01    73820    56     484      0      0 Overweig…     1       0     6.00
# ... with 4 more variables: diet_b12 <dbl>, b12_trans <dbl>,
#   .fitted <dbl>, .se.fit <dbl>
modelB_predictions %>%
    summarize(RMSE_modelB = sqrt(mean((b12_trans - .fitted) ^2)),
              MAE_modelB = mean(abs(b12_trans - .fitted)))
# A tibble: 1 x 2
  RMSE_modelB MAE_modelB
        <dbl>      <dbl>
1      0.0103    0.00779
# Graph cross validated prediction errors for model 2
modelB_predictions %>%
    mutate(errors = b12_trans - .fitted) %>%
    ggplot(., aes(x = errors)) +
    geom_histogram(bins = 30, fill = "darkviolet", col = "yellow") + 
    labs(title = "Cross-Validated Errors in Prediction of 1/sqrt(vitamin B12)",
         subtitle = "Using model B",
         x = "Error in predicting transformed vitamin B12")

10-fold Cross-validation of models A and B revealed that Model B had lower root mean squared error and mean absolute error, as described in the table below:

Model R2 Adjusted R2 AIC BIC df RMSE MAE
A with RCS 0.179 0.170 -6399.1 -6335.3 12 1.90 0.49
B 0.107 0.103 -6328.8 -6299.3 5 0.01 0.01

Given the fact that model A performed better in sample but worse upon cross validation, with using several more degrees of freedom, this was concerning for overfitting the data. Model B had much smaller RMSE and MAE compared to Model A. Therefore, I chose model B from the best subsets analysis as my final model.

10.5 Validation and calibration of final model

Next, to further validate my chosen model, I created an OLS model to calculate validated summary statistics and calibrate via residual analysis. The validated R2 statistic is 0.085, which is decreased from an already poor 0.107. The mean squared error is again excellent at 0.0001 upon validation.

ANOVA comparisons demonstrate that supplemental B12, alcohol, and age have statistically significant predictive value for the transformed B12 level. Smoking, which is the primary outcome of interest, does not appear to have a statistically significant effect on vitamin B12 level once the other covariates have been adjusted for. These findings are also supported by ANOVA plots and summary plots for the OLS model B.

summary(modelB)

Call:
lm(formula = b12_trans ~ smk100 + supp_b12 + etoh + age, data = b12_final)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.030939 -0.006692 -0.000986  0.005776  0.043631 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.252e-02  1.847e-03  23.021  < 2e-16 ***
smk100      -8.672e-04  6.809e-04  -1.274  0.20307    
supp_b12    -3.996e-06  3.979e-07 -10.043  < 2e-16 ***
etoh         2.245e-03  7.733e-04   2.903  0.00378 ** 
age         -5.479e-05  2.754e-05  -1.990  0.04688 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01019 on 995 degrees of freedom
Multiple R-squared:  0.1066,    Adjusted R-squared:  0.1031 
F-statistic: 29.69 on 4 and 995 DF,  p-value: < 2.2e-16
# Creation of OLS model to calculate validated summary statistics
d <- datadist(b12_final)
options(datadist="d")

modelB_ols <- ols(b12_trans ~ smk100 + supp_b12 + etoh + age, data = b12_final,x = TRUE, y = TRUE)
modelB_ols
Linear Regression Model
 
 ols(formula = b12_trans ~ smk100 + supp_b12 + etoh + age, data = b12_final, 
     x = TRUE, y = TRUE)
 
                Model Likelihood     Discrimination    
                   Ratio Test           Indexes        
 Obs    1000    LR chi2    112.77    R2       0.107    
 sigma0.0102    d.f.            4    R2 adj   0.103    
 d.f.    995    Pr(> chi2) 0.0000    g        0.003    
 
 Residuals
 
        Min         1Q     Median         3Q        Max 
 -0.0309389 -0.0066916 -0.0009861  0.0057756  0.0436311 
 
 
           Coef    S.E.   t      Pr(>|t|)
 Intercept  0.0425 0.0018  23.02 <0.0001 
 smk100    -0.0009 0.0007  -1.27 0.2031  
 supp_b12   0.0000 0.0000 -10.04 <0.0001 
 etoh       0.0022 0.0008   2.90 0.0038  
 age       -0.0001 0.0000  -1.99 0.0469  
 
# Validated summary statistics
set.seed(432123)
validate(modelB_ols)
          index.orig training   test optimism index.corrected  n
R-square      0.1066   0.1162 0.0949   0.0213          0.0853 40
MSE           0.0001   0.0001 0.0001   0.0000          0.0001 40
g             0.0028   0.0029 0.0028   0.0002          0.0026 40
Intercept     0.0000   0.0000 0.0007  -0.0007          0.0007 40
Slope         1.0000   1.0000 0.9830   0.0170          0.9830 40
# Plotting of coefficient effects
anova(modelB_ols)
                Analysis of Variance          Response: b12_trans 

 Factor     d.f. Partial SS   MS           F      P     
 smk100       1  0.0001682933 0.0001682933   1.62 0.2031
 supp_b12     1  0.0104637741 0.0104637741 100.87 <.0001
 etoh         1  0.0008740486 0.0008740486   8.43 0.0038
 age          1  0.0004107388 0.0004107388   3.96 0.0469
 REGRESSION   4  0.0123218522 0.0030804631  29.69 <.0001
 ERROR      995  0.1032191204 0.0001037378              
plot(anova(modelB_ols))

10.5.1 Calibration via evaluation of residual plots

The calibration via evaluation of residual versus fitted values plot below demonstrates that the there are issues with outliers in the model, particularly observation #578. The values are clustered around the mean transformed B12 value of 0.04, and appear to take a “fuzzy football” shape, but the outlier actually has a negative transformed B12 level, which is not a realistic/feasible value. The Residuals versus leverage plot indicates that this observation has both high leverage and significantly increased Cook’s distance.

# Looking at residual plots for calibration of model
plot(modelB, which=1)

plot(modelB, which=2)

plot(modelB, which=3)

plot(modelB, which=5)

10.5.2 Investigation of outlier, with sensitivity analysis

I further investigated observation #578, and it appears that this subject has an extremely high level of supplemental B12 (15100 mcg), which is likely the cause of his or her low predicted transformed B12 level in this model. I therefore dropped observation 578 and reran the model to perform a sensitivity analysis:

modelB.no578 <- lm(b12_trans ~ smk100 + supp_b12 + etoh + age, data=b12_final[-578,])
summary(modelB.no578)

Call:
lm(formula = b12_trans ~ smk100 + supp_b12 + etoh + age, data = b12_final[-578, 
    ])

Residuals:
      Min        1Q    Median        3Q       Max 
-0.031330 -0.006485 -0.001148  0.005849  0.040683 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.294e-02  1.824e-03  23.545   <2e-16 ***
smk100      -7.309e-04  6.721e-04  -1.087   0.2771    
supp_b12    -5.483e-06  4.814e-07 -11.391   <2e-16 ***
etoh         2.067e-03  7.635e-04   2.707   0.0069 ** 
age         -5.535e-05  2.716e-05  -2.037   0.0419 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01005 on 994 degrees of freedom
Multiple R-squared:  0.1298,    Adjusted R-squared:  0.1263 
F-statistic: 37.05 on 4 and 994 DF,  p-value: < 2.2e-16
# Reexamine residual plots
plot(modelB.no578, which=1)

plot(modelB.no578, which=2)

plot(modelB.no578, which=3)

plot(modelB.no578, which=5)

The new model without observation 578 appears much more reasonable and does not seriously violate any assumptions of linear regression. There are no longer issues with high leverage/influence points or non-constant variance. I will therefore present my final model with validated summary statistics without observation 578.

10.6 Final Model Summary

# Model with lm
summary(modelB.no578)

Call:
lm(formula = b12_trans ~ smk100 + supp_b12 + etoh + age, data = b12_final[-578, 
    ])

Residuals:
      Min        1Q    Median        3Q       Max 
-0.031330 -0.006485 -0.001148  0.005849  0.040683 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.294e-02  1.824e-03  23.545   <2e-16 ***
smk100      -7.309e-04  6.721e-04  -1.087   0.2771    
supp_b12    -5.483e-06  4.814e-07 -11.391   <2e-16 ***
etoh         2.067e-03  7.635e-04   2.707   0.0069 ** 
age         -5.535e-05  2.716e-05  -2.037   0.0419 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01005 on 994 degrees of freedom
Multiple R-squared:  0.1298,    Adjusted R-squared:  0.1263 
F-statistic: 37.05 on 4 and 994 DF,  p-value: < 2.2e-16
confint(modelB.no578)
                    2.5 %        97.5 %
(Intercept)  3.936036e-02  4.651790e-02
smk100      -2.049928e-03  5.880288e-04
supp_b12    -6.427669e-06 -4.538454e-06
etoh         5.688311e-04  3.565407e-03
age         -1.086522e-04 -2.039787e-06
coef(modelB.no578)
  (Intercept)        smk100      supp_b12          etoh           age 
 4.293913e-02 -7.309498e-04 -5.483061e-06  2.067119e-03 -5.534598e-05 
# Backtransforming to original B12 scale
1/(coef(modelB.no578)^2)
 (Intercept)       smk100     supp_b12         etoh          age 
5.423673e+02 1.871651e+06 3.326242e+10 2.340287e+05 3.264584e+08 
# Creation of OLS model without #578 to demonstrate nomogram and effect sizes
d <- datadist(b12_final)
options(datadist="d")
modelB.no578.ols <- ols(b12_trans ~ smk100 + supp_b12 + etoh + age, data=b12_final[-578,])

The final model (using lm) is the following:

1/sqrt(vitamin B12 level) = 4.29e-02 - 7.31e-04(smk100) - 5.48e-06(supp_b12) + 2.07e-03(etoh) - 5.54e-05(age)

The model has an adjusted R2 of 0.1263, indicating that 12.6% of the variance in transformed B12 level is explained by the model. The ANOVA F-test p-value is <2.2e-16, indicating that the model has statistically significant predictive value.

The model can be interpreted as follows (each statement explains coefficient assuming other predictors are held constant, significance level = 0.05):

  • Those who smoked at least 100 cigarettes in their life, on average had a non-significant decrease in transformed vitamin B12 level of 7.31e-04 pg/ml (95% CI for decrease 2.05e-03 to -5.88e-04), compared to those who did not.

  • A 1 mcg increase in supplemental B12 intake resulted in a statistically significant decrease in transformed vitamin B12 level of 5.48e-06 pg/ml (95% CI for decrease 4.54e-06 to 6.43e-06).

  • Those who had at least 12 alcoholic beverages in the past year, on average had a statistically significant increase in transformed vitamin B12 level of 2.07e-03 pg/ml (95% CI 3.57e-03 to 5.69e-04)

  • A 1 year increase in age resulted in a statistically significant decrease in transformed vitamin B12 level of 1.09e-04 pg/ml (95% CI for decrease 1.09e-04 to 2.04e-06).

10.7 Predictions using final model

Next, I attempted to use my model to predict transformed vitamin B12 level for potential new subjects.I will use the model to create predictions where supplemental B12 can vary from 0 to 5000 and smoking can be either 0 or 1, with etoh = 0 and age = 61. The results are plotted below. It appears that at higher levels of supplemental B12, the standard errors for the model predictions are slightly higher.

# Tibble to demonstrate predictions
predictions <- Predict(modelB.no578.ols, supp_b12 = seq(0,5000), smk100 = c(0,1))
tbl_df(predictions)
# A tibble: 10,002 x 7
   smk100 supp_b12  etoh   age   yhat  lower  upper
    <dbl>    <int> <int> <dbl>  <dbl>  <dbl>  <dbl>
 1   0           0     0  61.0 0.0396 0.0383 0.0408
 2   1.00        0     0  61.0 0.0388 0.0372 0.0405
 3   0           1     0  61.0 0.0396 0.0383 0.0408
 4   1.00        1     0  61.0 0.0388 0.0372 0.0405
 5   0           2     0  61.0 0.0396 0.0383 0.0408
 6   1.00        2     0  61.0 0.0388 0.0372 0.0405
 7   0           3     0  61.0 0.0395 0.0383 0.0408
 8   1.00        3     0  61.0 0.0388 0.0372 0.0405
 9   0           4     0  61.0 0.0395 0.0383 0.0408
10   1.00        4     0  61.0 0.0388 0.0372 0.0404
# ... with 9,992 more rows
# Creating graph
ggplot(Predict(modelB.no578.ols, supp_b12 = 0:5000, smk100 = c(0,1)))

# Augment to calculate fitted values and backtransform to original B12 scale
b12_aug <- augment(modelB.no578) 
b12_aug$fit_trans <- (1/(b12_aug$.fitted^2))
b12_aug$orig_b12 <- (1/(b12_aug$b12_trans^2))

I then used a nomogram to accomplish predictions for an individual subject with smk100 = 1, supp_b12 = 2500, etoh = 0, and age = 45:

smk100 = 1 –> 0 points

supp_b12 = 2500 –> 72 points

etoh = 0 —> 0 points

age = 45 —> 2 points (nomogram too small to tell exactly here)

total points = 74, which leads to a predicted transformed B12 level of 0.016, which translates to a B12 level of 3906.3 pg/ml.

plot(nomogram(modelB.no578.ols))

11 Task 11: An Analysis: Logistic Regression

11.1 Exploratory Data Analysis

A Spearman \(\rho^2\) plot indicates that supplemental and dietary B12, as well as smoking, are the strongest predictors for vitamin B12 deficiency. Supplemental and dietary B12 are important confounders of this relationship that will be important to account for in this analysis.

plot(spearman2(b12def ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12, data = b12_final))

11.2 Model 1 - evaluating nonlinear terms

I will begin with model 1, a logistic regression model using LRM that will include the following nonlinear terms, in addition to the main effects of the other possible predictors.

  • Restricted cubic splines with 3 knots for supp_b12
  • Restricted cubic splines with 3 knots for diet_b12
  • An interaction term for smoking and BMI
skim(b12_final)
Skim summary statistics
 n obs: 1000 
 n variables: 11 

Variable type: factor 
 variable missing complete    n n_unique
  bmi_cat       0     1000 1000        4
                            top_counts ordered
 Obe: 372, Ove: 361, Nor: 254, Und: 13   FALSE

Variable type: integer 
 variable missing complete    n      mean      sd    p0      p25 median
      age       0     1000 1000    61.01    11.87    41    51        61
  antacid       0     1000 1000     0.14     0.35     0     0         0
   b12def       0     1000 1000     0.048    0.21     0     0         0
     etoh       0     1000 1000     0.74     0.44     0     0         1
       id       0     1000 1000 78566.26  2937.18 73559 75995.25  78582
   smk100       0     1000 1000     0.45     0.5      0     0         0
  vit_b12       0     1000 1000   875.15   943.18   154   481       669
      p75  p100     hist
    70       80 ▆▅▆▇▇▆▅▇
     0        1 ▇▁▁▁▁▁▁▁
     0        1 ▇▁▁▁▁▁▁▁
     1        1 ▃▁▁▁▁▁▁▇
 81011.5  83721 ▇▇▇▇▇▇▇▇
     1        1 ▇▁▁▁▁▁▁▆
   933.25 12494 ▇▁▁▁▁▁▁▁

Variable type: numeric 
  variable missing complete    n    mean      sd     p0   p25 median
 b12_trans       0     1000 1000   0.039   0.011 0.0089 0.033  0.039
  diet_b12       0     1000 1000   4.76    4.11  0.16   2.49   3.8  
  supp_b12       0     1000 1000 256.76  811.31  0.16   6     25    
    p75      p100     hist
  0.046     0.081 ▁▂▆▇▅▁▁▁
  5.86     54.41  ▇▂▁▁▁▁▁▁
 75     15100     ▇▁▁▁▁▁▁▁
# Creating model 1 with nonlinear terms
dd <- datadist(b12_final)
options(datadist = "dd")

model_01 <- lrm(b12def ~ age + smk100 + bmi_cat + etoh + antacid + rcs(supp_b12,3) + rcs(diet_b12,3) + smk100 %ia% bmi_cat, data = b12_final, x = TRUE, y = TRUE)

model_01
Logistic Regression Model
 
 lrm(formula = b12def ~ age + smk100 + bmi_cat + etoh + antacid + 
     rcs(supp_b12, 3) + rcs(diet_b12, 3) + smk100 %ia% bmi_cat, 
     data = b12_final, x = TRUE, y = TRUE)
 
                     Model Likelihood     Discrimination    Rank Discrim.    
                        Ratio Test           Indexes           Indexes       
 Obs         1000    LR chi2     21.09    R2       0.065    C       0.697    
  0           952    d.f.           14    g        0.889    Dxy     0.394    
  1            48    Pr(> chi2) 0.0993    gr       2.432    gamma   0.394    
 max |deriv|    1                         gp       0.034    tau-a   0.036    
                                          Brier    0.045                     
 
                              Coef    S.E.    Wald Z Pr(>|Z|)
 Intercept                    -1.9845  1.0060 -1.97  0.0485  
 age                          -0.0061  0.0131 -0.46  0.6429  
 smk100                        0.7837  0.5560  1.41  0.1586  
 bmi_cat=Obese                 0.4633  0.5280  0.88  0.3802  
 bmi_cat=Overweight           -0.6434  0.6585 -0.98  0.3286  
 bmi_cat=Underweight          -4.4729 21.6512 -0.21  0.8363  
 etoh                          0.3036  0.3948  0.77  0.4419  
 antacid                      -0.3077  0.4884 -0.63  0.5287  
 supp_b12                     -0.0036  0.0023 -1.54  0.1233  
 supp_b12'                     0.0547  0.0412  1.33  0.1839  
 diet_b12                     -0.2703  0.1325 -2.04  0.0414  
 diet_b12'                     0.2724  0.1751  1.56  0.1198  
 smk100 * bmi_cat=Obese       -0.7407  0.7159 -1.03  0.3009  
 smk100 * bmi_cat=Overweight  -0.0654  0.8433 -0.08  0.9382  
 smk100 * bmi_cat=Underweight  5.2158 21.6812  0.24  0.8099  
 
anova(model_01)
                Wald Statistics          Response: b12def 

 Factor                                          Chi-Square d.f. P     
 age                                              0.22       1   0.6429
 smk100  (Factor+Higher Order Factors)            3.24       4   0.5183
  All Interactions                                1.40       3   0.7056
 bmi_cat  (Factor+Higher Order Factors)           6.23       6   0.3983
  All Interactions                                1.40       3   0.7056
 etoh                                             0.59       1   0.4419
 antacid                                          0.40       1   0.5287
 supp_b12                                         4.15       2   0.1257
  Nonlinear                                       1.77       1   0.1839
 diet_b12                                         5.47       2   0.0650
  Nonlinear                                       2.42       1   0.1198
 smk100 * bmi_cat  (Factor+Higher Order Factors)  1.40       3   0.7056
 TOTAL NONLINEAR                                  4.12       2   0.1272
 TOTAL NONLINEAR + INTERACTION                    5.61       5   0.3457
 TOTAL                                           18.13      14   0.2010
plot(anova(model_01))

Based on this ANOVA and summary statistics above, none of the candidate main effects or the nonlinear terms appear to have a statistically significant impact in predicting vitamin B12 deficiency. Therefore, I will not use nonlinear predictors in this analysis.

11.3 Model 2 - Kitchen sink model with main effects only, evaluate variable selection with stepwise regression

# LRM fit
dd <- datadist(b12_final)
options(datadist = "dd")

model_ks_lrm <- lrm(b12def ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12, data = b12_final, x= T, y = T)
model_ks_lrm
Logistic Regression Model
 
 lrm(formula = b12def ~ age + smk100 + bmi_cat + etoh + antacid + 
     supp_b12 + diet_b12, data = b12_final, x = T, y = T)
 
                      Model Likelihood     Discrimination    Rank Discrim.    
                         Ratio Test           Indexes           Indexes       
 Obs          1000    LR chi2     15.34    R2       0.048    C       0.658    
  0            952    d.f.            9    g        0.804    Dxy     0.316    
  1             48    Pr(> chi2) 0.0820    gr       2.235    gamma   0.317    
 max |deriv| 0.005                         gp       0.029    tau-a   0.029    
                                           Brier    0.045                     
 
                     Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept           -2.2676 0.9234 -2.46  0.0141  
 age                 -0.0065 0.0130 -0.50  0.6189  
 smk100               0.4509 0.3180  1.42  0.1562  
 bmi_cat=Obese        0.0281 0.3537  0.08  0.9366  
 bmi_cat=Overweight  -0.6699 0.4081 -1.64  0.1007  
 bmi_cat=Underweight  0.4213 1.0900  0.39  0.6991  
 etoh                 0.2607 0.3909  0.67  0.5048  
 antacid             -0.3574 0.4866 -0.73  0.4627  
 supp_b12            -0.0007 0.0005 -1.54  0.1244  
 diet_b12            -0.0966 0.0579 -1.67  0.0952  
 
# GLM fit for stepwise regression
model_ks_glm <- glm(b12def ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12, data = b12_final, family = binomial)
summary(model_ks_glm)

Call:
glm(formula = b12def ~ age + smk100 + bmi_cat + etoh + antacid + 
    supp_b12 + diet_b12, family = binomial, data = b12_final)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5208  -0.3592  -0.2938  -0.2236   2.8189  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)  
(Intercept)        -2.2675525  0.9233691  -2.456   0.0141 *
age                -0.0064595  0.0129853  -0.497   0.6189  
smk100              0.4509053  0.3180127   1.418   0.1562  
bmi_catObese        0.0281224  0.3536579   0.080   0.9366  
bmi_catOverweight  -0.6698961  0.4080915  -1.642   0.1007  
bmi_catUnderweight  0.4213010  1.0900244   0.387   0.6991  
etoh                0.2606947  0.3908961   0.667   0.5048  
antacid            -0.3573551  0.4865757  -0.734   0.4627  
supp_b12           -0.0007386  0.0004806  -1.537   0.1244  
diet_b12           -0.0965625  0.0578693  -1.669   0.0952 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 385.17  on 999  degrees of freedom
Residual deviance: 369.83  on 990  degrees of freedom
AIC: 389.83

Number of Fisher Scoring iterations: 7
step(model_ks_glm)
Start:  AIC=389.83
b12def ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + 
    diet_b12

           Df Deviance    AIC
- age       1   370.07 388.07
- bmi_cat   3   374.27 388.27
- etoh      1   370.29 388.29
- antacid   1   370.41 388.41
<none>          369.83 389.83
- smk100    1   371.86 389.86
- diet_b12  1   373.37 391.37
- supp_b12  1   373.82 391.82

Step:  AIC=388.07
b12def ~ smk100 + bmi_cat + etoh + antacid + supp_b12 + diet_b12

           Df Deviance    AIC
- bmi_cat   3   374.57 386.57
- etoh      1   370.67 386.67
- antacid   1   370.68 386.68
- smk100    1   371.98 387.98
<none>          370.07 388.07
- diet_b12  1   373.64 389.64
- supp_b12  1   374.23 390.23

Step:  AIC=386.57
b12def ~ smk100 + etoh + antacid + supp_b12 + diet_b12

           Df Deviance    AIC
- etoh      1   374.94 384.94
- antacid   1   375.18 385.18
<none>          374.57 386.57
- smk100    1   376.62 386.62
- diet_b12  1   378.23 388.23
- supp_b12  1   378.32 388.32

Step:  AIC=384.94
b12def ~ smk100 + antacid + supp_b12 + diet_b12

           Df Deviance    AIC
- antacid   1   375.50 383.50
<none>          374.94 384.94
- smk100    1   377.84 385.84
- diet_b12  1   378.45 386.45
- supp_b12  1   378.72 386.72

Step:  AIC=383.5
b12def ~ smk100 + supp_b12 + diet_b12

           Df Deviance    AIC
<none>          375.50 383.50
- smk100    1   378.44 384.44
- diet_b12  1   379.11 385.11
- supp_b12  1   379.21 385.21

Call:  glm(formula = b12def ~ smk100 + supp_b12 + diet_b12, family = binomial, 
    data = b12_final)

Coefficients:
(Intercept)       smk100     supp_b12     diet_b12  
 -2.7169067    0.5116181   -0.0007194   -0.0964161  

Degrees of Freedom: 999 Total (i.e. Null);  996 Residual
Null Deviance:      385.2 
Residual Deviance: 375.5    AIC: 383.5

The model indicated by backwards stepwise elimination is the following:

b12def ~ smk100 + supp_b12 + diet_b12

Therefore, I will compare this model (model 3) to the kitchen sink model via various tests below.

dd <- datadist(b12_final)
options(datadist = "dd")

model_03_lrm <- lrm(b12def ~ smk100 + supp_b12 + diet_b12, data=b12_final, x = TRUE, y = TRUE)

model_03_glm <- glm(b12def ~ smk100 + supp_b12 + diet_b12, data=b12_final, family = binomial)

11.4 Comparing kitchen sink model to model 3 (smoking, supplementary and dietary B12)

11.4.1 ANOVA

Based on the output below, it does not appear that model 3 improves upon the kitchen sink model via ANOVA.

anova(model_ks_glm, model_03_glm)
Analysis of Deviance Table

Model 1: b12def ~ age + smk100 + bmi_cat + etoh + antacid + supp_b12 + 
    diet_b12
Model 2: b12def ~ smk100 + supp_b12 + diet_b12
  Resid. Df Resid. Dev Df Deviance
1       990     369.83            
2       996     375.50 -6  -5.6701
pchisq(5.6701,6, lower.tail=FALSE)
[1] 0.4611408

11.4.2 AIC and BIC

On AIC/BIC evaluation, Model 3 has lower AIC and BIC than the kitchen sink model.

glance(model_ks_glm)
  null.deviance df.null    logLik      AIC      BIC deviance df.residual
1      385.1674     999 -184.9134 389.8268 438.9044 369.8268         990
glance(model_03_glm)
  null.deviance df.null    logLik     AIC     BIC deviance df.residual
1      385.1674     999 -187.7485 383.497 403.128  375.497         996

11.4.3 ROC Curves/C-statistic

# ROC curve for kitchen sink model (AUC = 0.658)
roc_model_ks <- roc(b12_final$b12def ~ predict(model_ks_glm, type = "response"), ci = TRUE)
roc_model_ks

Call:
roc.formula(formula = b12_final$b12def ~ predict(model_ks_glm,     type = "response"), ci = TRUE)

Data: predict(model_ks_glm, type = "response") in 952 controls (b12_final$b12def 0) < 48 cases (b12_final$b12def 1).
Area under the curve: 0.6579
95% CI: 0.5856-0.7303 (DeLong)
plot(roc_model_ks)

# ROC curve for model 3 (AUC = 0.616)
roc_model_03<- roc(b12_final$b12def ~ predict(model_03_glm, type = "response"), ci = TRUE)
roc_model_03

Call:
roc.formula(formula = b12_final$b12def ~ predict(model_03_glm,     type = "response"), ci = TRUE)

Data: predict(model_03_glm, type = "response") in 952 controls (b12_final$b12def 0) < 48 cases (b12_final$b12def 1).
Area under the curve: 0.6155
95% CI: 0.5389-0.6921 (DeLong)
plot(roc_model_03)

11.4.4 Nagelkerke’s R-squared

Based on the LRM output, Nagelkerke’s R2 is 0.048 for the kitchen sink model, and 0.030 for model 3. Neither of the values are particularly good, but the kitchen sink model appears to be better in this regard.

model_ks_lrm
Logistic Regression Model
 
 lrm(formula = b12def ~ age + smk100 + bmi_cat + etoh + antacid + 
     supp_b12 + diet_b12, data = b12_final, x = T, y = T)
 
                      Model Likelihood     Discrimination    Rank Discrim.    
                         Ratio Test           Indexes           Indexes       
 Obs          1000    LR chi2     15.34    R2       0.048    C       0.658    
  0            952    d.f.            9    g        0.804    Dxy     0.316    
  1             48    Pr(> chi2) 0.0820    gr       2.235    gamma   0.317    
 max |deriv| 0.005                         gp       0.029    tau-a   0.029    
                                           Brier    0.045                     
 
                     Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept           -2.2676 0.9234 -2.46  0.0141  
 age                 -0.0065 0.0130 -0.50  0.6189  
 smk100               0.4509 0.3180  1.42  0.1562  
 bmi_cat=Obese        0.0281 0.3537  0.08  0.9366  
 bmi_cat=Overweight  -0.6699 0.4081 -1.64  0.1007  
 bmi_cat=Underweight  0.4213 1.0900  0.39  0.6991  
 etoh                 0.2607 0.3909  0.67  0.5048  
 antacid             -0.3574 0.4866 -0.73  0.4627  
 supp_b12            -0.0007 0.0005 -1.54  0.1244  
 diet_b12            -0.0966 0.0579 -1.67  0.0952  
 
model_03_lrm
Logistic Regression Model
 
 lrm(formula = b12def ~ smk100 + supp_b12 + diet_b12, data = b12_final, 
     x = TRUE, y = TRUE)
 
                      Model Likelihood     Discrimination    Rank Discrim.    
                         Ratio Test           Indexes           Indexes       
 Obs          1000    LR chi2      9.67    R2       0.030    C       0.616    
  0            952    d.f.            3    g        0.642    Dxy     0.231    
  1             48    Pr(> chi2) 0.0216    gr       1.901    gamma   0.232    
 max |deriv| 0.002                         gp       0.023    tau-a   0.021    
                                           Brier    0.045                     
 
           Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept -2.7169 0.3166 -8.58  <0.0001 
 smk100     0.5116 0.2995  1.71  0.0876  
 supp_b12  -0.0007 0.0005 -1.49  0.1362  
 diet_b12  -0.0964 0.0572 -1.69  0.0916  
 

11.4.5 Validating summary statistics

For the kitchen sink model, the corrected Nagelkerke’s R2 is 0.001, which is worse than the original estimate of 0.048. The corrected C-statistic is 0.596, which is decreased from the original value of 0.658.

For model 3, the corrected Nagelkerke’s R2 is 0.005, which is again much worse than an already poor estimate of 0.030. The corrected C-statistic is 0.580, which is again less than the original value of 0.616.

However, model 3 marginally outperforms the kitchen sink model here.

validate(model_ks_lrm)
          index.orig training    test optimism index.corrected  n
Dxy           0.3158   0.3863  0.2623   0.1239          0.1919 40
R2            0.0476   0.0744  0.0277   0.0466          0.0010 40
Intercept     0.0000   0.0000 -1.3536   1.3536         -1.3536 40
Slope         1.0000   1.0000  0.5285   0.4715          0.5285 40
Emax          0.0000   0.0000  0.4575   0.4575          0.4575 40
D             0.0143   0.0230  0.0079   0.0151         -0.0007 40
U            -0.0020  -0.0020  0.0059  -0.0079          0.0059 40
Q             0.0163   0.0250  0.0020   0.0230         -0.0066 40
B             0.0451   0.0446  0.0455  -0.0009          0.0460 40
g             0.8044   1.1368  0.5670   0.5698          0.2346 40
gp            0.0294   0.0358  0.0212   0.0145          0.0149 40
validate(model_03_lrm)
          index.orig training    test optimism index.corrected  n
Dxy           0.2310   0.3093  0.2283   0.0810          0.1500 40
R2            0.0301   0.0501  0.0247   0.0254          0.0047 40
Intercept     0.0000   0.0000 -0.8609   0.8609         -0.8609 40
Slope         1.0000   1.0000  0.7065   0.2935          0.7065 40
Emax          0.0000   0.0000  0.2718   0.2718          0.2718 40
D             0.0087   0.0152  0.0069   0.0082          0.0004 40
U            -0.0020  -0.0020  0.0101  -0.0121          0.0101 40
Q             0.0107   0.0172 -0.0032   0.0203         -0.0097 40
B             0.0453   0.0461  0.0456   0.0006          0.0448 40
g             0.6424   1.3365  0.5654   0.7710         -0.1286 40
gp            0.0227   0.0285  0.0197   0.0088          0.0139 40

\[ C = 0.5 + \frac{Dxy}{2} = 0.5 + \frac{.1917}{2} = 0.5 + 0.0959 = 0.596 \] \[ C = 0.5 + \frac{Dxy}{2} = 0.5 + \frac{.1601}{2} = 0.5 + 0.0801 = 0.580 \]

11.4.6 Comparing calibration of models

Based on the calibration plots below, neither model appears to be well calibrated in predicting vitamin B12 deficiency. It appears that model 3 may again marginally outperform the kitchen sink model here, as the bias-corrected line stays closer to the ideal line.

plot(calibrate(model_ks_lrm))


n=1000   Mean absolute error=0.007   Mean squared error=1e-04
0.9 Quantile of absolute error=0.009
plot(calibrate(model_03_lrm))


n=1000   Mean absolute error=0.005   Mean squared error=4e-05
0.9 Quantile of absolute error=0.01

11.4.7 Choosing final model

Model C-statistic Nagelkerke’s R2 AIC BIC Brier score
Kitchen sink 0.658 0.048 389.8 438.9 0.045
Model 3 0.616 0.030 383.5 403.1 0.045

Based on the summary of comparision information above, I choose model 3 as my final model. Although AIC and BIC were lower for model 3, I felt its performance in terms of discrimination and calibration were consistently, though marginally, superior to the kitchen sink model.

11.5 Validating and Calibrating Final Model (model 3)

The validation below indicates a corrected C-statistic of 0.586 and Nagelkerke’s R-squared of 0.0128. These measures both indicate overall poor discriminatory ability.

set.seed(432321)
validate(model_03_lrm, B = 100)
          index.orig training    test optimism index.corrected   n
Dxy           0.2310   0.2873  0.2287   0.0586          0.1724 100
R2            0.0301   0.0420  0.0247   0.0173          0.0128 100
Intercept     0.0000   0.0000 -0.6975   0.6975         -0.6975 100
Slope         1.0000   1.0000  0.7565   0.2435          0.7565 100
Emax          0.0000   0.0000  0.2167   0.2167          0.2167 100
D             0.0087   0.0126  0.0069   0.0057          0.0030 100
U            -0.0020  -0.0020  0.0026  -0.0046          0.0026 100
Q             0.0107   0.0146  0.0043   0.0103          0.0003 100
B             0.0453   0.0454  0.0455  -0.0001          0.0454 100
g             0.6424   0.9133  0.5625   0.3508          0.2916 100
gp            0.0227   0.0261  0.0199   0.0063          0.0165 100
plot(calibrate(model_03_lrm))


n=1000   Mean absolute error=0.007   Mean squared error=5e-05
0.9 Quantile of absolute error=0.009

As demonstrated previously, the calibration of the final model is also suboptimal. The bias-corrected line deviates from the ideal line, and whether it is over or under depends upon the value of the predicted probability. Based on this information, I would use caution in trusting the predictions made by this model.

inf <- which.influence(model_03_lrm, cutoff = 0.3)
show.influence(object = inf, dframe = data.frame(b12_final))
    Count supp_b12 diet_b12
111     1    *1400    1.215
190     1    *1000    2.015
232     1    *1000    1.885
490     1    *1000    1.475
984     1       12  *11.635
plot(model_03_glm, which=5)

In evaluating influence, there appear to be 4 values of supp_b12 and 1 value of diet_b12 that are particularly influential on the model. However, none of these have a Cook’s distance >0.5.

11.6 Summarizing Final Model

The final model, model 3, is as follows:

Log odds of vitamin B12 deficiency = -2.72 + 0.51(smk100) - 0.0007(supp_b12) - 0.096(diet_b12)

The odds ratios from the final model, as indicated by the GLM model, are as follows. All odds ratios are reported assuming the other covariates in the model are held constant.

  • Compared to those who did not smoke at least 100 cigarettes in their lifetime, those who smoked at least 100 cigarettes had 1.668 times greater the odds (95% CI 0.93 to 3.03) of having B12 deficiency. However, this result was not statistically significant.

  • The odds ratio of B12 deficiency associated with a 1 mcg increase in supplementary B12 was 0.999 (95% CI 0.998 to 1.00), and this was not statistically significant.

  • The odds ratio of B12 deficiency associated with a 1 mcg increase in dietary B12 was 0.908 (95% CI 0.80 to 1.00), but this was not statistically significant

Based on the LRM model, which is also reflected in the graph below:

  • Compared to those who did not smoke at least 100 cigarettes in their lifetime, those who smoked at least 100 cigarettes had 1.668 times greater the odds (95% CI 0.93 to 3.00) of having B12 deficiency. However, this result was not statistically significant

  • The odds ratio of B12 deficiency associated with an increase from the 25th percentile (6.0 mcg) to 75th percentile (75.0 mcg) of supplementary B12 was 0.95 (95% CI 0.89 to 1.02), and this was not statistically significant. In other words, increasing supplementary B12 intake was associated with a non-significant reduction in the odds of B12 deficiency.

  • The odds ratio of B12 deficiency associated with an increase from the 25th percentile (2.49 mcg) to 75th percentile (5.86 mcg) of dietary B12 intake was 0.72 (95% CI 0.50 to 1.05), and this was not statistically significant. In other words, increasing dietary B12 intake was associated with a non-significant reduction in the odds of B12 deficiency.

summary(model_03_glm)

Call:
glm(formula = b12def ~ smk100 + supp_b12 + diet_b12, family = binomial, 
    data = b12_final)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4527  -0.3502  -0.3043  -0.2570   2.6855  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.7169067  0.3166447  -8.580   <2e-16 ***
smk100       0.5116181  0.2995068   1.708   0.0876 .  
supp_b12    -0.0007194  0.0004827  -1.490   0.1362    
diet_b12    -0.0964161  0.0571548  -1.687   0.0916 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 385.17  on 999  degrees of freedom
Residual deviance: 375.50  on 996  degrees of freedom
AIC: 383.5

Number of Fisher Scoring iterations: 7
exp(coef(model_03_glm))
(Intercept)      smk100    supp_b12    diet_b12 
 0.06607884  1.66798804  0.99928090  0.90808605 
exp(confint(model_03_glm))
                 2.5 %    97.5 %
(Intercept) 0.03512361 0.1215716
smk100      0.93014116 3.0307757
supp_b12    0.99808267 1.0000080
diet_b12    0.80364933 1.0024600
summary(model_03_lrm)
             Effects              Response : b12def 

 Factor      Low    High    Diff. Effect    S.E.     Lower 0.95 Upper 0.95
 smk100      0.0000  1.0000  1.00  0.511620 0.299510 -0.075405  1.098600  
  Odds Ratio 0.0000  1.0000  1.00  1.668000       NA  0.927370  3.000100  
 supp_b12    6.0000 75.0000 69.00 -0.049636 0.033309 -0.114920  0.015649  
  Odds Ratio 6.0000 75.0000 69.00  0.951580       NA  0.891440  1.015800  
 diet_b12    2.4925  5.8625  3.37 -0.324920 0.192610 -0.702430  0.052590  
  Odds Ratio 2.4925  5.8625  3.37  0.722580       NA  0.495380  1.054000  
plot(summary(model_03_lrm))

11.7 Making Predictions - Nomogram

plot(nomogram(model_03_lrm, fun = plogis,
              fun.at=c(0.05, seq(0.1, 0.9, by = 0.1), 0.95),
              funlabel="Pr(low = 1)"))

11.8 Making predictions - Graphs

**** add more here *****

ggplot(Predict(model_03_lrm))

# on probability scale
ggplot(Predict(model_03_lrm, fun = plogis))

12 Task 12: Discussion/Conclusions

My project was intended to evaluate the effects of smoking on vitamin B12 deficiency, adjusting for known risk factors and confounders of vitamin B12 deficiency (particularly age, antacid use, and dietary/supplementary B12 intake). Linear regression modeling revealed that although smoking was not statistically significantly associated with serum B12 level, supplemental B12 intake, alcohol intake, and age were.

%%% put in directions of associations once transformation figured out %%%

Based upon the logistic regression, none of the hypothesized covariates appeared to have statistically significant effects on B12 deficiency, particularly the main predictor of interest, smoking.

Although the overall sample size was adequate (n = 1000), the study was somewhat limited by a small proportion of patients who had the outcome of interest for the logistic regression model (n = 48), even when the threshold for vitamin B12 deficiency was increased to 300 pg/ml. This issue likely limited my ability to fully evaluate the effects of the predictors of interest. Another issue with this study was that I utilized a complete case analysis, based on the available data from NHANES. Again, although the sample size was adequate, the study was likely susceptible to selection bias which may have resulted in certain patients (most concerning of whom are those with the outcome of interest) being excluded from the analysis. If I had to redo this project, I would have tried to include more patients with the outcome of interest and utilized imputation strategies to have a more representative sample of the NHANES population.

Laura Baldassari

2018-03-07 21:39:57